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Abstract 


The method of maximum-likelihood is typically applied to item response theory (IRT) models 
when the ability parameter is estimated while conditioning on the true item parameters. In 
practice, the item parameters are unknown and need to be estimated first from a calibration 
sample. Lewis (1985) and Zhang and Lu (2007) proposed the expected response functions (ERFs) 
and the corrected weighted-likelihood estimator (CWLE), respectively, to take into account the 
uncertainty regarding item parameters for purposes of ability estimation. In this paper, we 
investigate the performance of ERFs and of the CWLE in different situations, such as various 
test lengths and levels of measurement error in item parameter estimation. Our empirical results 
indicate that ERFs can cause the bias in ability estimation to fall within [—0.2,0.2] for all 
conditions, whereas the CWLE can effectively reduce the bias in ability estimation provided that 
it has a good foundation to start from. 


Key words: Item response theory, maximum-likelihood estimator, expected response functions, 
measurement-error modeling, weighted-likelihood estimator, corrected weighted-likelihood 
estimator 
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1 Introduction 


Traditionally, the maximum-likelihood method is applied to item response theory 
(IRT) models when the examinee ability parameter 9 is estimated, which results in the 
maximum-likelihood estimator (MLE) 9 rn . In that case, item parameters are usually estimated 
first from a calibration sample, and then the MLEs of 9 1 s in the target sample are calculated with 
the estimated item parameters treated as fixed and known. When the item parameter estimation 
is both accurate and precise, replacing the real item parameters with the estimated ones is not 
unreasonable, and it is acceptable to estimate the ability parameters in the regular way. Under 
the assumption that the real item parameters are known, both Lord (1983) and Warm (1989) 
proposed corrections of 9 m for bias based on the asymptotic expansion. It seems reasonable to 
assume that, if the uncertainty of item parameters does not introduce extra biases in 9 m , then 
these corrections should remain applicable, too. 

However, it has been found that, when the estimated item parameters are used as 
substitutes for the real ones, neither the MLE with Lord bias-correction (MLE-LBC) nor Warm’s 
weighted-likelihood estimator (WLE) will be as effective as they are supposed to be, especially 
when employing the 3PL model (Zhang, 2005). As a result, the measurement error in item 
parameter estimation must be considered a potential contaminator to the ability estimation 
process as well. Lewis (1985) came up with the idea of expected response functions (ERFs), which 
incorporate the uncertainty of item parameters by averaging out the noises induced by estimation 
in the item response functions (IRFs). Along this line, Mislevy, Wingersky, and Sheehan (1994) 
provided the operational procedures for applied work with ERFs. On the other hand, Song (2003) 
and Zhang, Xie, Song, and Lu (2007) derived the bias-correction formulas for the MLE and WLE 
of 9 by asymptotic expansion with imperfect estimated item parameters under certain regularity 
conditions. This method is called measurement-error modeling. Zhang and Lu (2007) embraced 
the thinking behind measurement-error modeling and proposed the corrected weighted-likelihood 
estimator (CWLE) method. 

To obtain an accurate ability estimate when employing the 3PL model, measurement error 
in ability estimation and in item parameter estimation clearly has to be taken into account, but 
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how these methods compare to each other is of concern. Accordingly, the main purpose of this 
study is to investigate the performance of ERFs and of the CWLE in different situations, such as 
various test lengths and levels of measurement error in item parameter estimation. It is also of 
practical interest to know when these bias-correction procedures should be applied. 

2 Methods 


In this section, the two bias-correction procedures are briefly introduced. Suppose there 
is a test with N dichotomously scored 3PL items, where X n is the score of a randomly selected 
examinee on item n in the calibration sample and Y n is the score of an examinee on item n in the 
target sample. (Examinees in the calibration sample are not necessarily the same group of people 
as those in the target sample.) Let /3 = (a, b, c) denote the vector of item parameters. The 3PL 
IRF is defined as the probability of answering an item correctly by a randomly selected examinee 
with ability 9, that is, 


F(9;P) = c+( 1 


c) 


1 

1 + exp{— 1.7a(9 — 6)} 


2.1 Expected Response Functions 

A corresponding ERF is defined as 


(1) 


F*(9) = E f} [F(e-,(3)] = J F(0;l3)-p(P)dp, (2) 

where p(/3) = p((3\x) is the posterior distribution of /3 with prior knowledge from the calibration 
sample. This ERF will be used to replace the original IRF in the likelihood function. 

Because of the integration, it is computationally intensive to take expectations whenever 
the ERFs are calculated. Thus, Mislevy et al. (1994) described an operational procedure as an 
alternative, which is the following: 

1. Obtain an estimate of the posterior distribution p(f3 n ), n = 1 ,... ,N. 

2. Specify a grid of J 9 values across the ability range of interest. Let 9j denote the jth grid 

point, j = 1,..., J. 

3. Draw K item parameter vectors from p((3 n ). Let /3^ be the £:th such draw. 
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(k) 

4. For each of the K sets of item parameters, determine P n J , the probability of a correct 
response to item n at 9j, where 


P { n ^ =p(y n = l\9 = e j ,p n = p^). 


( 3 ) 


5. Compute the expectation at each point 9j by averaging the probabilities obtained in Step 4: 




(4) 


k= 1 


For the nth item, the collection of points {(9j, F*(9j)) : j = 1,..., J} is referred to as a 
nonparametric ERF because it does not assume any parametric form. The nonparametric ERF is 
further approximated by a close-fitting 3PL curve F**. The MLEs for the 3PL item parameters 
/ 3** = (a**, &**, c** ) that best approximate F* are found by maximizing 


J 

n 


F**{9j-p**) F ^ ej) 


1 -F**(9 j - 1 fi* n *) 


1 -Ffflj) 


Wj 


(5) 


over the J-point 9 grid, where Wj is a weight that specifies the relative importance of fitting 
F** at 9j. The resulting 3PL approximation is referred to as a fitted ERF of the nth item. The 
likelihood function is thus determined with the fitted ERFs of all items serving as substitutes for 
the IRFs, and standard approaches to estimating ability parameters such as MLE and EAP can 
be applied immediately. 


2.2 Corrected Weighted-Likelihood Estimator 

The other method considered here is measurement-error modeling (Song, 2003; Zhang et 
al., 2007). In this method, the bias of 9 m can be decomposed into two sources of measurement 
error: One is the bias of 9 m given item parameters, and the other is the bias resulting from the 
uncertainty of item parameters. The former has been addressed by Lord’s MLE-LBC and by 
Warm’s WLE. If the latter can be quantified, subtracting it from either MLE-LBC or WLE should 
result in an unbiased (or less biased) estimate. In fact, Zhang and Lu (2007) made good use of 
the idea and proposed the CWLE method, which is described in the theorem that follows. 

Let F n (9) = F(9: fi n ) be the probability of the nth item being answered correctly by a 
randomly selected examinee with ability 9. Under the assumption of local independence (Lord, 
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( 6 ) 


1980), the likelihood function for 9 given the responses y={y \,..., yjv} is 

N 

my) = H[ F n(9)}y^[i-F n m 1 - yn - 

n=l 

If the item parameters (3 n are known, the MLE 0 m is the maximizer of Equation 6, which also 
satisfies the likelihood equation: 

° hlL ^ y) = 1.7 J2 a n K n (6)(y n - F n (9 )) = 0, (' 


where 


F n (0) — I\ (9, CL n i ^ni C n ) — 


Pn{0) = _1_ 

F n (6) 1 + c n exp{-1.7a n (6» - 6 n )} 


Pn(9) = 


1 + exp{—1.7a n (0 - b n )} 


is the 2PL model. Let Q n (9) = 1 — P n {9). 

Let I n (6 ) be the item-information function for item n, 


I n {9) = 1.7 2 a 2 n (l - c n )P n {9)Q n (9)K n {9 ), 


and let 


m = j2in(o) 


be the test-information function. Given item parameters, Lord (1983) applied the asymptotic 
expansion to the likelihood equation and obtained the the following bias function for 9 m : 


i 7 N f \ 

B (0) = 727m E “nln(9) ( P n (6) ~ 0.5 j . 

' n =1 ' ' 


The MLE-LBC of 9 is defined as 9 C = 9 m — B(9 m ). 

Warm (1989) proposed the WLE based on Lord’s work. The WLE, 9 W , is the maximizer 
of the function f[9)L[9 |y), where f{9) is a suitable chosen function satisfying 


^ In m 


= —B(9)I(9). 


Therefore, 9 W satisfies the following weighted-likelihood equation, 


Bin[f(9)L(0\y)\ = 17 J2 anKn (9)(y n - F n (9)) - B(9) 1(d) = 0. 
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The WLE can be shown to be less biased than the MLE with the same asymptotic variance and 
normal distribution. 

In reality, /3’s are estimated in the calibration sample, and then their estimates, /3’s, are 
fixed as substitutes for the true /3’s. When the maximum-likelihood method is applied to estimate 
the examinees’ 0’s in the target sample, instead of Equation 7, 9 m must satisfy 

N 

'52a n k n (0)(y n -Fn(0)) = 0 (14) 

n=1 

with K n {9) = K{9\ a n ,b n ,c n ) and F n (9 ) = F{9\ a n , b n , c n ). Similarly, 9 W must satisfy 

N 

1.7 J2 & nK n (0)(y n - Fn{9)) - B(9)I(9) = 0. (15) 

n= 1 

To account for the uncertainty of item parameters, they are assumed to be measured with 
errors. Let 


II 

dn $ani 

E(6 n ) — b n + 5jyn) 

E(c n ) — C n + Scm 

Var(a n ) 

2 

® an") 

Var(6 n ) = 

$ 

II 

q 

3 10 

Cov(a n , 

bn) — ^a6n? 

Cov(6 n , Cn) — O'bcn’i 

and Cov(a n , c n ) = a acn , 


where 5 an , and S cn are the biases of a n , b n , and c n , respectively. a^ n , of n , a^ n . a a bn, ^bcm and 

a aC n are elements of the variance/covariance matrix of the triplet (a n , b n , c n ). No distributional 
assumptions are needed, but the following regularity conditions are required to establish the whole 
theory of CWLE: 

(AO) Item parameters a n and b n are uniformly bounded, and c n is bounded away from 1. 9 is a 
bounded variable. 

(Al) There exists no such that for any n > n o, limM-»oo = 0, where M is the calibration 
sample size and a% = maxi<„<jv{^„, ^ n , 8% n , S* n }. 
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(A2) 


lim ~jr T Var t(«n “ a n) 2 ] = 0, 

/#-/V ■ ■* 


N 


lim Var t(^ - b n ) 2 } = 0, 


N 


M—xx) N 


1 


n= 1 
N 


M—>oo iV 


Jl S 1 oo ]v^ VarI(a, * _Cn ) 2] = 0j 


i 


n= 1 
AT 


lim — y~] Var[(a n - a n )(b n - b n )\ = 0, 

M —>oo iv z —' 


n=l 

N 


n=l 


N 


Jmio ^ Var[(a n - a n )(c n - c n )] = 0, and Jim^ ^ ^ Var[(6 n - b n )(c n - c n )] = 0. 


n= 1 


n=l 


(A3) (a n — a n )/cran, (bn — b n )/ab n , and (c n — c n )/a cn have uniformly bounded four moments. 
(A4) For any fixed 9, there exists cq(9) > 0 such that lim inf jv->oc 1(0)/N > c$(0) > 0. 

The details of these regularity conditions can be found in Zhang and Lu (2007). 

In the following theorem, notation o p (-) is needed. If Fjy = Gn + o p (Hn), it means that 
(Fn — Gn)/Hn converges to zero in probability (Serfling, 1980). 


Theorem (Zhang, Xie, Song, & Lu, 2007) 

Suppose that 6 W is the regular WLE of 9 and satisfies Equation 15, where the estimated item 
parameters, /3’s, are regarded as fixed and known. Assume that the regularity conditions 
(A0)-(A4) hold. Then 

9 W = 9 + [J(9) + Q(9) + Z(0) - B(9)1(9)}/1(9) + o^max 

where 1(9) is the test-information function given by Equation 11, B(9) is given by Equation 12, 
and 

N 

Ji(9) = -1.7 2 J2(9-b n )(l-c n )P n (9)Q n (9)K n (9) 

n=l 

{1-7 a n (9 - b n ) [0.5 - P n (9) + c n L n (0 )] + 1} (a 2 an + <S 2 J, 

N 

J 2 (9) = —1.7 3 £ a 3 (l - c n )P n (9)Q n (9)K n (9) [0.5 - P n (9) + c n L n (9 )] (a 2 bn + 5 2 n ), 

n=l 

N 

MO) = 1.7 2 £ 2a n (l - c n )P n (9)Qn(9)K n (9) 

71 = 1 

{1.7a„(0 - b n ) [0.5 - P n (9) + c n L n (0)\ + 1} (a abn + 5 an Sbn), 
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N 

Ji(6) = 1.7 a n Q n (6)K n (6)L n (e)(al n +8l n ), 

n= 1, Cn >0 
N 

MO) = 1.7 E Qn(0)^n(0){1.7a n (0-6 n )[l-2 Cn L n (0)]-l}(a acn + ^ n( 5 cn ), 

n=l, Cn>0 
N 

M9) = —1.7 2 E a 2 Q n (0)K n (0)[l-2c n L re (0)]( C T bm + ( 5 fe n4n), 

n=l, c„>0 

J(0) = Ji(0) + Ji{e) + J 3 (9) + J&(0) + Jz>{0) + Jq(9), 

N 

Ql( 0 ) = —1.7 2 E an(^ - M(1 " Cn)i 3 „(^)Qn(0)7in(0)5an, 

n=l 

N 

Q 2 (9) = 1.7 2 ^a 2 n (l-c n )P n (9)Q n (9)K n (9)6 bn , 

n=1 

N 

Q 3 (0) = -1.7 E OnQnO^tfnO?)^, 

n=l, Cn>0 

Q(9) = Qi(0) + Q2{9) + Qz{9), and 

N 

Z(9) = 1.7 J2a n K n (9)(y n -F n (9)). 

n =1 

This theorem gives the error terms of bias in 9 W , which are obtained by treating /3’s 
estimated from a calibration sample as though they are the true values when they actually have 
associated statistical errors. According to Equation 16, the total bias of 9 W is given by 

[m+Q{e) + z{e)-B{e)m]/m, 

which is the sum of (a) the bias of 9 W given /3’s as the true values, which equals 
[Z(9) — B (9) I (9)]/1(9); and (b) the bias of substituting /3’s for /3’s, which is [J(9) + Q(9)\/I(9). 
When the theorem is applied to a practical situation and 9 is estimated by 9 W , 1(9), J(9 ), Q(9), 
Z(9), and B(9) have to be replaced by their estimates, I(9 W ), J(9 W ), Q(9 W ), Z(9 W ), and B(9 W ), 
respectively. It is clear that Z(9 W ) — B(9 W )I(9 W ) = 0 from Equation 15. Accordingly, the new 
bias-corrected ability estimator is defined as 

9 W c = L~ [J(9 W ) + Q(9 w )]/i(9 w ), (17) 
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where 9 WC denotes the CWLE and J, Q , and I are all evaluated at 9 W . Apparently, the CWLE 
works best when there is a reasonable 9 W to start from. (This will be demonstrated in Section 3.) 


3 Simulation Studies 


3.1 Design 

Some simulation studies were conducted to allow for a comparison of the ERFs and CWLE. 
For all studies, two test lengths ( N=35 and 1V=70) are considered, and all items are modeled by 
3PL models. The real item parameters are generated marginally in Matlab (The MathWorks, 
Inc., 2007): Log(a) follows a normal distribution with mean —0.1 and standard deviation 0.4, 
b follows a normal distribution with mean 0 and standard deviation 1.5, and c follows a beta 
distribution with mean 0.22 and standard deviation 0.05. The true /3’s are shown in Table 1. 

Our goal is to examine the performance of ERFs and CWLE with different levels of 
measurement error in f3 estimation from the calibration sample. The use of different software 
or calibration sample sizes would yield different levels of measurement error, and it is hard to 
enumerate all possible combinations of these conditions. For example, large bias in (3 estimation 
may result from bad software, a calibration sample size that is too small, or both. To have 
better control on the magnitude of these measurement errors, 5 a ’ s, 5b s, and <5 c ’s are generated in 
Matlab. Three levels of measurement error in (3 estimation are considered: large bias, median 
bias, and negligible bias, which can be compared to the operational scenarios where (3's are 
estimated from a calibration sample of size < 500, 1,000, and > 5,000, respectively. (We assume 
that the larger the calibration sample size, the better the item parameter estimation, regardless of 
the software used.) Without loss of generality, the triplet ( 5 a ,5b,5 c ) for each item is assumed to 
follow a multivariate normal distribution, the mean and variances/covariances of which vary from 
item to item. Besides, it is natural to assume that the triplets for different items are independent. 
Each time we generate a set of {( 5 an ,5b n ,5 C n ) : n = 1,... , N} for a total of N items, which are 
treated as the item bias estimates, and the estimated item parameters ( a n ,b n ,c n ) for the nth item 
are calculated by 

An = Ojn + 5am t>n = b n T 5bni And C n = C n T 5cm 
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Table 1 

Simulated True Item Parameters for the 70-Item Test 


Item 

a 

b 

c 

Item 

a 

b 

c 

1 

1.2145 

0.0723 

0.2835 

36 

0.8332 

0.6681 

0.3200 

2 

0.9932 

2.4498 

0.2012 

37 

0.9514 

2.6131 

0.3561 

3 

0.5657 

-0.6576 

0.3064 

38 

0.6048 

1.1258 

0.2698 

4 

1.2740 

1.7239 

0.2582 

39 

0.7591 

-1.0534 

0.2331 

5 

0.7467 

1.3533 

0.2636 

40 

0.7553 

-0.0035 

0.2699 

6 

2.2014 

-0.0363 

0.2350 

41 

0.6303 

-2.1020 

0.2121 

7 

0.6670 

1.0233 

0.1862 

42 

0.8788 

1.8163 

0.3153 

8 

0.8617 

0.7997 

0.2024 

43 

0.6037 

-1.1131 

0.2983 

9 

0.9157 

-0.0487 

0.2832 

44 

1.0568 

-0.3722 

0.2236 

10 

0.9063 

1.2036 

0.3496 

45 

0.5610 

-1.4502 

0.2884 

11 

0.4602 

-1.7750 

0.2747 

46 

1.5003 

-1.9449 

0.2368 

12 

0.8763 

1.6128 

0.4302 

47 

0.6732 

-2.0891 

0.2933 

13 

0.6917 

3.3107 

0.2828 

48 

0.6824 

0.0053 

0.3418 

14 

1.2359 

0.3697 

0.2755 

49 

0.9529 

2.6770 

0.2407 

15 

0.9117 

-0.9586 

0.3024 

50 

0.8636 

-3.4394 

0.3951 

16 

1.5037 

-0.8225 

0.3110 

51 

1.0132 

-2.3428 

0.2354 

17 

0.8800 

-0.4547 

0.3925 

52 

0.7933 

0.2393 

0.2555 

18 

0.6777 

-1.9274 

0.3303 

53 

0.6666 

-0.0767 

0.2359 

19 

1.7620 

0.6543 

0.2772 

54 

0.9970 

3.0302 

0.1813 

20 

1.0723 

0.4940 

0.2354 

55 

1.2376 

-3.5479 

0.3135 

21 

0.9151 

0.5579 

0.2104 

56 

0.7182 

-1.2628 

0.2447 

22 

1.8462 

0.2401 

0.2663 

57 

0.9063 

-2.4380 

0.2705 

23 

1.6068 

0.0391 

0.2892 

58 

0.5784 

-2.7701 

0.2560 

24 

0.9381 

0.4437 

0.3172 

59 

1.5682 

-0.7267 

0.2733 

25 

0.6205 

-0.0567 

0.3358 

60 

1.0389 

-1.0317 

0.2629 

26 

1.5043 

0.2674 

0.2842 

61 

1.0407 

-0.9062 

0.2596 

27 

1.6106 

0.2059 

0.3404 

62 

0.6053 

-1.2803 

0.2780 

28 

1.1303 

0.6481 

0.2471 

63 

0.9283 

-0.3579 

0.3197 

29 

1.3088 

0.0140 

0.1752 

64 

1.9090 

-1.5824 

0.3210 

30 

0.7699 

0.7510 

0.2383 

65 

0.8335 

-0.7725 

0.2208 

31 

0.9393 

-2.2205 

0.1912 

66 

0.7925 

2.4777 

0.2431 

32 

1.3867 

-0.1077 

0.2963 

67 

0.5538 

1.1808 

0.2504 

33 

0.8239 

1.8788 

0.2107 

68 

1.3536 

-0.5254 

0.2669 

34 

1.0747 

-1.3066 

0.2006 

69 

1.1407 

0.8589 

0.2630 

35 

1.9959 

0.9734 

0.2024 

70 

1.2149 

-0.2157 

0.2677 


Note. The true item parameters for the 35-item test are the first 35 items. 


The above procedure is repeated 100 times, and the sample variances/covariances are used to 
approximate the true variance/covariance matrix of the /?’s. In other words, there is no restriction 
on the correlation between any two estimated item parameters of the same item. The correlation 
can be positive or negative, depending completely on the resulting estimated item parameters. 
Table 2 presents the mean simulated bias of the /3’s based on 100 replications. The whole 
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procedure, which is the calibration step, is replicated for each test length. 


In the target sample, 13 ability levels are set up with 100 examinees at each level for 
each replication. The estimated item parameters obtained at the previous stage are fixed to 
obtain the 9 W . The CWLEs are then calculated from Equation 17 using the simulated biases and 
the estimated variance/covariance matrices. As for the ERFs, the posterior distribution has to 


Table 2 

Mean Simulated Bias of Estimated Item Parameters Based on 100 Replications 




Large bias 


Median bias 


Negligible bias 


Item 

a 

b 

c 

a 

b 

c 

a 

b 

C 

1 

-0.0673 

-0.1418 

-0.0042 

0.0963 

-0.1088 

0.0233 

-0.0088 

0.0021 

-0.0010 

2 

-0.0725 

-0.1037 

0.0232 

-0.0147 

-0.0974 

-0.0296 

0.0022 

0.0092 

-0.0047 

3 

0.0520 

-0.0514 

-0.0654 

0.0655 

0.0147 

-0.0010 

-0.0113 

-0.0144 

-0.0002 

4 

0.0020 

-0.0123 

-0.0147 

0.0593 

-0.0047 

0.0404 

0.0014 

-0.0160 

0.0020 

5 

0.0687 

-0.0504 

-0.0403 

-0.1160 

-0.0620 

0.0235 

0.0083 

-0.0096 

-0.0031 

6 

-0.0521 

-0.0860 

-0.0286 

-0.0552 

0.0431 

-0.0333 

-0.0033 

-0.0107 

0.0002 

7 

-0.0172 

-0.1482 

-0.0015 

-0.1807 

-0.0656 

-0.0238 

0.0081 

0.0018 

-0.0021 

8 

-0.0178 

-0.0028 

0.0033 

-0.0541 

-0.1072 

-0.0578 

0.0027 

-0.0012 

0.0084 

9 

-0.0743 

-0.0716 

0.0037 

0.1408 

-0.0628 

-0.0069 

0.0132 

-0.0017 

-0.0007 

10 

-0.0509 

-0.0718 

-0.0491 

-0.0585 

-0.1190 

-0.0099 

-0.0137 

-0.0030 

-0.0057 

11 

-0.0827 

0.0238 

-0.0088 

0.0145 

0.0219 

0.0184 

-0.0020 

-0.0006 

-0.0065 

12 

-0.1175 

-0.1352 

-0.0160 

-0.0613 

0.0381 

0.0085 

0.0104 

-0.0040 

0.0010 

13 

0.0185 

-0.0133 

0.0127 

0.1680 

-0.0603 

-0.0003 

0.0046 

-0.0050 

-0.0031 

14 

-0.0936 

-0.1151 

-0.0103 

0.0231 

-0.0888 

-0.0037 

0.0011 

-0.0071 

0.0006 

15 

-0.0236 

-0.2164 

-0.0191 

-0.1471 

-0.0322 

0.0289 

-0.0093 

-0.0207 

-0.0033 

16 

-0.0368 

-0.0676 

- 0.0000 

-0.0146 

-0.0708 

0.0024 

-0.0040 

-0.0046 

0.0067 

17 

0.0098 

-0.1156 

-0.0495 

0.0416 

-0.1104 

-0.0523 

-0.0038 

0.0023 

-0.0066 

18 

-0.0244 

-0.1088 

-0.0248 

0.0869 

-0.0128 

-0.0244 

0.0080 

0.0038 

0.0071 

19 

- 0.0000 

-0.1935 

-0.0343 

0.1507 

-0.0266 

0.0333 

-0.0089 

-0.0110 

0.0028 

20 

-0.0343 

-0.2045 

-0.0079 

0.0570 

-0.1056 

-0.0053 

0.0002 

-0.0073 

-0.0077 

21 

0.0039 

-0.0713 

0.0184 

-0.0854 

-0.0125 

-0.0026 

-0.0021 

0.0013 

0.0029 

22 

0.0417 

-0.0951 

-0.0464 

0.0237 

-0.0987 

-0.0049 

-0.0001 

0.0012 

-0.0011 

23 

-0.1374 

-0.1076 

0.0087 

-0.0530 

-0.0785 

0.0077 

0.0061 

0.0007 

0.0049 

24 

-0.0566 

-0.1159 

-0.0492 

-0.0525 

0.0533 

-0.0105 

0.0007 

-0.0011 

0.0015 

25 

-0.0532 

-0.0820 

-0.0257 

-0.0842 

-0.1187 

0.0292 

-0.0054 

-0.0102 

-0.0034 

26 

-0.1202 

-0.0441 

-0.0444 

0.0794 

-0.0643 

0.0072 

0.0006 

-0.0175 

-0.0001 

27 

0.0071 

-0.1400 

-0.0193 

-0.1645 

0.0070 

-0.0058 

0.0056 

-0.0002 

0.0024 

28 

0.0358 

-0.0753 

-0.0265 

-0.0791 

-0.0265 

-0.0312 

0.0041 

-0.0080 

-0.0075 

29 

-0.0828 

-0.1192 

0.0437 

0.0750 

-0.0090 

-0.0054 

0.0013 

-0.0110 

-0.0022 

30 

-0.0071 

-0.0240 

0.0028 

0.1542 

-0.1247 

-0.0480 

-0.0019 

0.0055 

-0.0023 

31 

0.0383 

-0.0165 

0.0117 

-0.0681 

-0.0863 

0.0214 

-0.0101 

-0.0129 

-0.0065 

32 

-0.0517 

-0.0264 

-0.0211 

0.0149 

0.0250 

-0.0204 

0.0037 

0.0058 

0.0025 

33 

-0.0508 

-0.1032 

0.0232 

0.1537 

-0.1540 

0.0165 

-0.0018 

0.0041 

0.0002 

34 

-0.0232 

-0.1583 

-0.0337 

0.1686 

-0.0819 

-0.0107 

0.0023 

-0.0016 

0.0035 

35 

-0.0221 

-0.1443 

0.0036 

0.0151 

-0.1214 

-0.0112 

0.0012 

-0.0054 

-0.0037 
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be determined. Approximating the transformed item parameters (3* = (log(a), b, logit(c)) by a 
multivariate normal distribution, the delta method is applied to get the posterior distribution of 
(3. This process was repeated 100 times to match 100 replications in the calibration step for each 
test length. As a result, there are 10,000 examinees in total at each of 13 ability levels. 

In our studies, the ability estimates of examinees who answered all the items right or 
wrong were set to be 4 or —4, respectively. In addition, we defined 9 WC = 9 W when 9 W = 4 or —4. 


3.2 Evaluation 

The accuracy and precision are studied by examining the bias and root mean square error 
(RMSE) of WLE, CWLE, and ERFs at each ability level. Let 9j and 9j denote the estimated and 
true abilities for the jth examinee at the Lth ability level, j = 1,..., 10,000 and L = 1,..., 13. 
Then, 

1 10000 

Biasi = - {9,j — 9A, (18) 

10000 ^ 1 3 35 v ’ 

3 = 1 


and 


RMSE l 


\ 


10000 


10000 


E ih - 9 >Y 


3 =1 


(19) 


The 9j is an estimate resulting from the method of WLE, CWLE, or ERFs. 


3.3 Results 

Figure 1 presents the results of the bias and RMSE of ability estimates for two different 
test lengths with large bias in j3. The top panel shows that, for the 35-item test, the CWLE 
is better in reducing bias than WLE for 9 values higher than —1.5. The bias of the CWLE is 
fairly close to zero for all 9 values higher than —1. On the other hand, ERFs do not appear to 
work well in reducing bias except for the median 9 values. Neither ERFs nor CWLE could reduce 
the RMSE. The graphs in the bottom panel make obvious that longer tests (i.e., N = 70) lead 
to better ability estimation, in terms of bias and RMSE, when the WLE or CWLE is applied. 
The maximum bias of the WLE is less than 0.1, while the bias of the CWLE stays around zero 
for almost all 9 values. This means the CWLE performs best when the bias of WLE is within 
a reasonable range. The bias for the ERFs is not significantly influenced by the change of test 
lengths because its range is [—0.2,0.2] in both situations, which agrees with the findings in Lewis 
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N=70, Large Bias 



0 


N=35, Large Bias 



Figure 1. Bias and root mean square error of ability estimates for two test lengths; 

large bias in item parameter estimation. 


(2001) that “the greater test lengths, at least in the 6 range studied, do not appear to reduce the 
relative effects of increasing uncertainty about the item parameters.” However, the RMSE of the 
ERFs does decrease for longer tests. 

Figure 2 illustrates the relationship between the magnitude of measurement error and the 
effectiveness of the two bias-correction methods. The left panel indicates that the bias of WLE 
approaches zero for 9 values higher than —2, when the amount of measurement error decreases 
towards zero. This observation verifies that the WLE performs well as long as the measurement 
error in j3 is negligible. CWLE still performs well in reducing bias with median or greater amounts 
of measurement error, but the difference between CWLE and WLE decreases as the measurement 
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error decreases. The two curves overlap when the bias of (3 becomes negligible. As expected, the 
curves of ERFs are almost identical in all cases; the bias is within [—0.2,0.2] for all 9 values. The 
possible noises are averaged out beforehand, even if the true (3 can be regarded as known. The 
right panel shows that the RMSE of the CWLE tends to the RMSE of the WLE when the amount 
of measurement error approaches zero. The RMSE of the ERFs stays the same for all cases. 

It is worth noting that, in both figures, when the 9 value is lower than —2, CWLE does 
not appear to work well for all 35-item tests in the sense that it does not reduce the bias of WLE. 
One possible explanation is that under our experimental conditions 9 W is not accurate enough for 
extreme low 9 values. The CWLE is a bias-reduction method based on the WLE and is applicable 
when the WLE can provide a good estimate. Adopting the bias-correction procedure in Equation 
17 without paying attention to the results of the WLE could lead to a worse 9 estimate. On the 
other hand, the ERFs method does not rely on any other estimation procedure. 

4 Conclusions 

Several conclusions can be drawn concerning the use of the two bias-correction procedures 
for 9 estimation in IRT models. First, both methods are easy to implement, and their underlying 
theories are intuitive (i.e., the CWLE corrects bias by removing it from the biased estimate, while 
the ERFs corrects it by averaging out the noise). Second, the CWLE can effectively reduce the 
bias in 9 estimation even when the bias in (3 is large, provided that reasonable results of WLE are 
found to initiate the bias-correction procedure given by Equation 17. Finally, the ERFs can cause 
the bias in 9 estimation to fall within [—0.2, 0.2], regardless of the test length and the magnitude 
of bias in item parameter estimation. 

Although the last two conclusions are based on simulation studies, there is no reason 
to expect different results for tests with operational characteristics similar to the experimental 
conditions. As discussed in Section 3, measurement errors in item parameter estimation are 
generated in our studies, and there is no restriction on the correlation between any two estimated 
item parameters of the same item. Further empirical studies are needed to see how different 
degrees of association between estimated item parameters will affect the efficiency of these two 
methods in reducing the bias in ability estimation. 
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